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Abstract — With an eye towards human-centered automation, 
we contribute to the development of a systematic means to infer 
featnres of human decision-making from hehavloral data. Moti¬ 
vated hy the common use of softmax selection in models of human 
decision-making, we stndy the maximum likelihood parameter 
estimation problem for softmax decision-making models with 
linear objective fnnctlons. We present conditions under which the 
likelihood fnnction is convex. These allow us to provide sufficient 
conditions for convergence of the resulting maximum likelihood 
estimator and to construct its asymptotic distribution. In the 
case of models with nonlinear objective functions, we show how 
the estimator can be applied by linearizing about a nominal 
parameter value. We apply the estimator to fit the stochastic 
UCL (Upper Credible Limit) model of human decision-making to 
human subject data. We show statistically significant differences 
in behavior across related, bnt distinct, tasks. 

Note to Practitioners: Abstract — We propose and demon¬ 
strate a rigorons method to estimate parameters of softmax 
decision-making models. These decision-making models hold 
great promise for nse in developing model-based hnman-centered 
automation. We are motivated by the recently derived UCL 
(Upper Credible Limit) model, which predicts the choices that 
humans are likely to make when deciding among alternatives 
with nncertain rewards. Key parameters of the model represent 
the human’s intuition about the task, and estimating these 
parameters from behavioral data would allow an automated 
system to learn about its human supervisor. Our parameter 
estimation method is fast enough to be implemented in real 
time for most scenarios, although our analysis of the method 
holds when the model has a particnlar linear strncture. We show 
how to extend the method to a more general nonlinear model 
using linearization, and we show that the linearization approach 
works for the motivating UCL model. The parameter estimation 
method with linearization can be used for other nonlinear models; 
however, the domain of its validity may vary. 

Primary and Secondary Keywords Index Terms — Primary 
Topics: Estimation, Automation, Decision-Making 

I. Introduction 

In a variety of decision-making scenarios an agent selects 
one among a discrete set of options i G m} and 

receives a reward associated with the selection. The agent’s 
goal is to make a selection or a sequence of selections to 
maximize reward. For example, a human air traffic controller 
selects among options for allocating aircraft for takeoff, and 
the reward is a measure of efficiency of flight departures asso¬ 
ciated with the selected option ll22l . Often the decision-making 
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task is challenging, especially when there is uncertainty or 
there are complex dependencies associated with options and 
rewards, as in the air traffic control example. In this paper 
we propose a rigorous method to estimate features of humans 
decision-making that can be used to enable human-centered 
automation. 

Much research has gone into studying how humans decide 
among options and what conditions lead to good decision¬ 
making performance. In this research, decision-making models 
are used together with empirical data. One common approach 
is to derive a decision-making model as the solution of an 
optimization problem. An objective function Qi is defined for 
each option i, and the model agent selects the option i* that 
maximizes the objective function: 


i* = argmax(5i. 

i 


The maximum operation is deterministic and non- 
differentiable, so for many applications it is replaced 
by the so-called ‘softmax’ operation, in which option i is 
chosen with probability 


Pr [i] 


exp((3^) 

E^iexp(Qj)' 


The softmax operation, which we adopt in this paper, is a 
stochastic, biologically-plausible approximation of the max¬ 
imum operation EH- Furthermore, it is differentiable with 
respect to its argument Qi, which makes it more analytically 
tractable. 

In contexts such as inverse reinforcement learning 12^ . 
EH and neuroscience ca, a central goal is to understand 
the decision-making process by finding the objective function 
values {Qi} that explain observed decisions. In this paper, we 
consider this problem in the case that each objective function 
value Qi is linear in a set of known variables x, i.e.. 


Qi = 9^ Xi, 6, Xi e 


( 1 ) 


Models of this form are often used in studies of human 
decision-making behavior, e.g., 0 , ca, a, 0 , and are 
therefore of interest in developing principled methods for 
human-centered automation. By assuming the functional form 
0, we reduce the problem of finding the objective function 
values to that of learning the vector of parameters 6, which 
we assume to be constant across options and decisions. We 
call the reduced problem the parameter estimation problem 
for softmax decision-making models with linear objective 
functions. 

The problem of learning the objective function that can ex¬ 
plain observed decision-making behavior is relevant for several 
different disciplines. In the behavioral sciences, it is often of 



interest to develop models that quantify the various factors 
that contribute to the decision-making process. Similarly, in 
engineering, system identification seeks to develop models of 
dynamic systems that can be used for engineering design. In 
either case the problem is generally solved in two steps. The 
hrst step is to determine which variables affect the process 
or system in question. In the context of this paper, this is 
equivalent to determining the variables x in Equation Q- 
The second step is to quantify the effect of each variable on 
the system. This is equivalent to learning the value of the 
parameters 9 in 0, i.e., solving the parameter estimation 
problem. We call the two-step process fitting. This paper 
develops an estimator with rigorous performance guarantees 
for the softmax decision-making model. This provides a tool 
for the second step in the htting process. 

For human-centered automation, a key goal is to develop 
systems that infer the intuition or the intent of a human 
operator. One approach is to posit a decision-making model 
with parameters representing intuition (or intent) and fit the 
model to observed human choice data. The estimator devel¬ 
oped in the present paper makes this possible when applied 
to an appropriate decision-making model. We demonstrate 
the estimator using an algorithmic model of human decision¬ 
making in a spatial search task, derived in The model, 
called the stochastic UCL (Upper Credible Limit) model, was 
derived for multi-armed bandit tasks in a Bayesian setting 
and was shown to qualitatively reproduce observed human 
behavior from experiments. We use our estimator to infer from 
these data the human decision-maker’s intuition in terms of a 
set of prior beliefs about the task. The estimator is applicable 
to a more general class of decision-making tasks that use a 
softmax decision-making model. 

As a motivating example of the softmax model, consider the 
case of deciding between m = 2 options each with a single 
(tiobj = 1) known variable = Xi, i = 1, 2, representing the 
value of the option, and 9 = 9 a scalar. Then the probability 
of picking option 1 is 


Pr [pick option 1] 


1 

1 -I- exp(-0(a;i - 2 : 2 ))' 


( 2 ) 


Figure [T] plots the probability 0 as a function of the difference 
in value of the two options Aa; = a;i — a; 2 . When the values 
of the two options are identical, the probability is equal to 0.5 
and it increases monotonically with increasing Ax. The rate 
of the increase is controlled by 9, which sets the slope of the 
function at Ax = 0. Large values of 9 increase the slope and 
make the choice represented by 0 discriminate between xi 
and X 2 with more sensitivity, while small values of 9 decrease 
the slope and make the choice less sensitive to Aa;. Models of 
this form have been used to study a variety of decision-making 
tasks m, Ea, ©, mi, m,' where finding the value of 9 
that explains a given set of decisions is an important problem. 

The parameter estimation problem for softmax decision¬ 
making models is related to other problems previously studied 
in the literature, in particular, the multinomial logistic regres¬ 
sion problem 0,113 and the conditional log likelihood model 
learning problem 0. With the linear functional form 0, 
the softmax decision-making model and the conditional log 



Fig. 1. The probability from the model ^ with m = 2 options and 
a scalar = 1) parameter 0. The probability of picking option 1 is a 

logistic function of Ax = xi — X 2 and the sensitivity to Ax is controlled 
by 6, which sets the slope at Ax = 0. 


likelihood model are formally equivalent, meaning that the 
parameter estimation problem has been studied in previous 
work, e.g., 0. The novelty of the present paper comes in 
the application of parameter estimators to a formal model of 
human decision-making (the stochastic UCL model) and its 
use in quantifying a human subject’s intuition about a decision¬ 
making task. 

The stochastic UCL model for human decision-making in 
spatial search tasks ll24l is a softmax decision model with an 
objective function Qjjcl that is a nonlinear function of several 
parameters. We show how Qjjcl can be transformed into a 
linear function of the form 0 by linearizing about a point in 
parameter space. 

We adopt a maximum likelihood approach to parameter 
estimation. In this framework, the convexity of a model implies 
asymptotic convergence of estimators and that the estimation 
problem is a convex optimization problem. The convexity of 
the conditional log likelihood model is an accepted fact in the 
natural language processing literature 0, so we do not focus 
on it here. We apply the standard optimization algorithms to 
the stochastic UCL estimation problem and demonstrate our 
results. 

There are two major contributions of this paper. First, we 
show how to apply standard parameter estimation techniques 
to the stochastic UCL model, a rigorously-derived model of 
human choice behavior. Models with a similar softmax func¬ 
tional form are commonly used in the neuroscience literature 
to model choice behavior and are likely to be widely applicable 
to the field of human-centered automation. Estimating the 
parameters of such models provides a method to quantify 
human intention and intuition in choice tasks. Second, we 
apply the parameter estimation techniques to empirical human 
choice data and find statistically significant differences be¬ 
tween groups of subjects presented with different experimental 
conditions. 

The remainder of the paper is structured as follows. Section 
II defines the softmax decision model. Section III defines 







the parameter estimation problem for the softmax model and 
reviews convergence results from the literature. Section IV 
summarizes conditions under which the maximum likelihood 
estimator converges. Section V provides a numerical example 
of the estimator. Section VI linearizes the stochastic UCL 
model about a nominal parameter 6 to yield a softmax decision 
model with a linear objective function, and applies the esti¬ 
mator to simulated data. Section VII applies the linearization 
procedure to ht the stochastic UCL model to human subject 
data. Section VIII concludes. 

II. The Softmax Decision Model 

In this section, we dehne our notation and the specihc 
softmax decision model for which we derive estimator con¬ 
vergence bounds. We also provide several examples of this 
model that appear in related literature. 


A. Notation 

In the spirit of ifTSll . we set the following notation. We as¬ 
sume we have n observations. For each observation k we have 
data consisting of d = m ■ Uobj explanatory variables and a 
response, corresponding to the assignment of one of m classes. 
Specihcally, for each observation k G n} we have 

data (x^,y^). For each class z G {1, • ■ • ,rn}, we have nobj 
explanatory variables x* G . The vector of explanatory 
variables G is composed of the concatenation of the 



The response variable = (yf,..., represents the 

class assignment, where the element yf = 1 if the observation 
corresponds to class i and zero otherwise. 

Motivated by models of decision-making ll24l . we consider 
the following statistical model; 


pfie)=PT [yf = l|x^0] 


exp (^S^x^ j 

E”. »p (»^A) 


(3) 


for i G m}, where 0 G is a weight vector 

that is the same for all classes. This is the softmax decision¬ 
making model with linear objective function o introduced 
above, which has been studied in other literatures under other 
names. In the natural language processing literature, is 
known as the conditional log-likelihood model, while in the 
econometrics literature, it is known as the conditional logit 
model ifTSl . 


B. Example softmax decision models 

In this section, we provide several concrete examples of 
the softmax decision model The goal is to make the 
connection between this functional form and others that appear 
in the literature. 

Example 1 (Softmax with unknown temperature). A standard 
decision model in reinforcement learning m is the so-called 


softmax action selection rule, which selects an option i with 
probability 


Pr [z] 


exp (U/r) 
E”=iexp(U,/T)’ 


where Vi is the value associated with option z and r is 
a positive parameter known as the temperature. This rule 
selects options stochastically, preferentially selecting those 
with higher values. The degree of stochasticity is controlled 
by the temperature r. In the limit r —> 0+, the rule reduces to 
the standard maximum and deterministically selects the option 
with the highest value of Vi. In the limit r —> -|-c», all options 
are equally probable and the rule selects options according to 
a uniform distribution. 

This model is in the form of ([^ with Uobj = 1- Specihcally, 
assume that the temperature r is constant but unknown, and 
the values V) are known. Then the two models are identical if 
we identify 

6 = 1/r, Xi = Vi. 


In the reinforcement learning literature, the quantity l/r is 
sometimes known as the inverse temperature and referred to by 
the symbol (3. Our methods allow one to estimate 9 = 1/t = ^ 
from observed choice data. 


Example 2 (Softmax with known cooling schedule form). A 
slightly more complicated model might let the softmax temper¬ 
ature T of Example [T] follow a known functional form, called 
a cooling schedule, that depends on an unknown parameter. 
For example, in simulated annealing, Mitra et al. QSl showed 
that good cooling schedules follow a logarithmic functional 
form; 


T{t) = 


V 

logf’ 


where t is the decision time and > 0 is a parameter. 

If iz is constant but unknown, this model can be represented 
in the form of ([^ with Uobj = 1 if we identify 


9 = l/v, Xi = Vlogt. 


Example 3 (Softmax Q-learning with unknown temperature 
and learning rate). According to a simple Q-leaining model 
1^ . for each choice time t the agent assigns an expected value 
Vl to each option z. The values are initialized to 0 at t = 1 
and then for each subsequent time, the agent picks option f, 
receives reward r*, and updates the value of the chosen option 
it according to 

V!;^ = Vi^+aSt, 


where a G [0,1] is a free parameter called the learning rate 
and 6t = rt — V*^ is the prediction error at time t. 

A common model in reinforcement learning a has the 
agent make decisions using a softmax rule on the value 
function U/, so the probability of selecting an option z at time 
t is 


Pr [it = i] 


exp(L//r) 

E;=iexp(f^Vr) 

exp {yl~^I t V a5t-if{i = it-i)/r) 
E"=i exp + a5t-il{i = Zi-i)/r) ’ 







where 1() is the indicator function, equal to 1 if its argument 
is a true statement, and 0 otherwise. Similar models are used 
in the analysis of fMRI data, e.g. Il34l . If and 

are known while r and a are unknown, the model is in the 
form of ([^ with nobj = 2 if we identify 

0 = [1/r; a/r] , = it-i)] ■ 

If only the initial value 1^/^^ = 0 is known, then the value 
function becomes a nonlinear function of the parameters 
a, r and the model is not of the form (|^, although it may be 
possible to find a transformation that puts it in such a form. 

In the following section we define the parameter estimation 
problem for the softmax model 0. We then analyze the 
problem to develop conditions under which this parameter 
estimation problem can be solved with provable guarantees 
about its convergence. 

III. Parameter estimation for softmax 

DECISION-MAKING MODELS 

In this section, we define the parameter estimation prob¬ 
lem for softmax decision-making models using a likelihood 
framework, and we review relevant results from the literature. 
Key to these results is the concept of concavity, which is a 
property of functions that can guarantee the uniqueness of 
a maximum. When the likelihood function is concave, the 
maximum likelihood estimation problem can be solved by off- 
the-shelf optimization algorithms. Concavity is also central to 
several results from the econometrics literature that provide 
conditions under which the estimator is guaranteed to converge 
asymptotically. 

In the optimization literature, it is traditional to consider 
minimization problems, for which convexity plays the same 
role as concavity does for maximization problems: a function 
/ is concave if the function —/ is convex, and maximizing / 
is equivalent to minimizing —/. Following the literature, we 
refer to concavity and convexity when discussing results from 
econometrics and optimization, respectively. We distinguish 
between two notions of concavity: a function / : K” —>■ M 
is weakly concave if its Hessian is negative semidefinite, and 
strongly concave if its Hessian is strictly negative definite. 

A. The softmax model parameter estimation problem 

In the parameter estimation problem for softmax decision¬ 
making models, we wish to estimate the values of 6 based 
on the observed data (x^,y^'). A standard way to perform 
parameter estimation is using the maximum likelihood method 
na. To perform maximum likelihood (ML) estimation of 9, 
one maximizes the log-likelihood function f(0). 

Problem 1. The maximum likelihood parameter estimation 
problem for the softmax decision model © is the optimization 
problem 

0ml = argmaxf(0), (4) 

e 


where i{9) is the logarithm of the likelihood function of the 
model 0, defined as 

n 

f(0) = ^logPr[y^■|x^0] (5) 

k^l 

n r m m 

-E E2/fev-iogEexp(0^xf) . 

k—l _'i—1 i—1 

The ML estimate 6ml can be interpreted as the parameter 
value 6 that makes the observed data most likely under the 
given model. 

A prior on 9 can be incorporated by adopting a maximum 
a posteriori (MAP) estimate, 

9map = argmaxL(0) = argmax[('(0) -f logp(0)], (6) 

e e 

with p{9) being the prior on 9. The MAP estimate penalizes 
ML estimates that are considered unlikely under the prior. 

B. Asymptotic behavior of the ML estimator 

The ML estimator 6ml solves the estimation problem in the 
frequentist framework, which posits that there is a true value 
00 of the parameters that we attempt to recover from analyzing 
the given data. In this framework, natural questions to be asked 
are 1) does 9ml 9o as the number of observations n grows, 
and 2) how dispersed is the difference 0ml — 9 fl These 
questions have been studied in the econometrics literature, 
for which ll20l is a standard reference. The remainder of this 
section summarizes the relevant results from Go). The answers 
to these two questions depend on two properties of the model, 
identification and concavity, defined as follows. 

Definition 1 (Identification). A statistical model with likeli¬ 
hood function f : K'J —> K and observed data x is said to be 
identified if, for all 0, 0o G K'*, 

0^0o^£(0o;x)^£(0;x). 

Definition 2 (Concavity). A statistical model with likelihood 
function ^ —>■ K is said to be concave if i{9; x) is strictly 

concave in 0. 

If a model with likelihood function i is identified and 
concave (see ll20l Theorem 2.7] for details), the answer to 
question 1) is yes. These two properties imply that the true 
value 00 of the parameter is the unique maximum of the 
expected value of the log-likelihood i{6). 

Concavity and identification can depend on both the func¬ 
tional form of £{9) and the observed data x. As an example of 
how the identification property may fail due to data, consider 
the model ([^ with x^ being the zero vector for each i. In this 
case, Pr [yi = l|x, 0] = 1/m for each i independent of 0 and 
the estimation procedure will be unable to distinguish among 
the possible parameter values. In the following section, we 
derive conditions on the data that ensure identification. These 
conditions also ensure that £{9) is strictly concave and provide 
guidelines for the design of experiments for estimating 0. 

The answer to question 2) is that, under mild regularity 
conditions, the distribution of 0ml approaches a normal 




distribution as the number of samples n grows. In particular, 
the following limit holds; 

0ML AAA(0O,J-Vn), (7) 

where A- signihes a limit in distribution as n —> oo and 
J = — E[H(0o)] is the negative of the expected value of 
the Hessian of £{6) with respect to 6. See ifTOl Chapter 9] for 
more details about the concept of a limit in distribution and 
see II 20 I Theorem 3.3] for full details of the conditions under 
which (0 holds. In practice one uses J = —H(0ml)/?t^ as an 
estimate of J. This permits construction of standard frequentist 
analysis tools, such as conhdence intervals for the parameter 
estimates and hypothesis tests. The estimate 9ml is efficient 
in the sense that it obeys the Cramer-Rao lower bound ifTSll on 
the variance of estimators 9, so no other unbiased estimator 
can have lower variance than 9ml- 

IV. Analysis of the maximum likelihood estimator 

FOR SOFTMAX DECISION MODELS WITH LINEAR 

OBJECTIVE FUNCTIONS 


AQ satishes E [|!AQ||2] = {9 - 0^E [XX^] {9 - 9'). 
Then by the assumption that E [XX^] is positive dehnite, 
E [||AQ||^] = 0 implies {9 — 9') = 0, so 0 = 0' and 
Q = Q'. Therefore the mapping between the parameters 9 
and the objective values is one-to-one, which 

implies that f(0|x, y) ^ £(0^|x, y) for 9^9' and the model 
is identihed. ■ 

The condition of Lemma is given in terms of an expec¬ 
tation, but in practice one has a given sample of data. In this 
case the expectation can be replaced by the sample average. 
Specihcally, dehne X^ for each observation k following •i- 
Then E [XX^] is estimated by 

1 " 

E [XX^] « -^X'=X'=^. 

If this sample average is positive dehnite, then the model is 
identihed. For the sample average to be positive dehnite it 
must be full rank = nobj, and each observation k can add at 
most m to the rank. Therefore, the following inequality must 
be satished for the model to be identihed; 


In this section we present conditions under which the model 
(|^ is identihed and concave. These conditions imply that the 
ML estimator 9ml converges and that the ML optimization 
problem 0 is convex. The concavity of the model is an 
accepted fact in the natural language processing literature HI; 
we summarize the result in Theorem [T] 

A. Asymptotic and finite-sample behavior 

Recall from Section [TlI-BI that two properties that guarantee 
asymptotic convergence of the ML estimator are identihcation 
and concavity. Whether or not the model ([^ satishes these 
properties can be a function of the data x^, fc G {1,..., n}. 
Recall our example where x^ = 0 for each i and k. In this 
case the probability Pr [i/fjx^,0] = 1/m for each i and k 
for all values of 9 and the likelihood function £{9) is Hat, so 
neither identihcation nor concavity is satished. 

However, a sufficient condition for identihcation is as fol¬ 
lows. Dehne the nobj x m matrix X^ by transforming the 
explanatory variable x*^ of a single observation k: 


m ■ n> Hobj ■ 

This gives a lower bound n > \nobj/m~\ on the minimum 
number of observations required for identihcation. For most 
applications, the number of options m will be larger than the 
number of parameters nobj, so the lower bound is trivial,. 
However, for cases with large number of parameters the bound 
can be useful for experimental design. 

The following theorem summarizes the conditions under 
which the ML estimator Q converges. 

Theorem 1 (Convergence of the ML estimator). Let X^ be 
defined as in 0. If the second-moment matrix 

1 ^ 

-V x'^ 

n ^ 

exists and is positive definite, then 

1) The ML optimization problem 0 is convex. 

2) The ML estimator 0ml for the model 0 is asymptoti¬ 
cally approximately distributed as 


X‘^=[xJ x^ ••• x^_i 0]. (8) 

Note that X'"' X^^ = sum(x^^x''). Considering iC as a 
random variable, the following lemma ensures identihcation. 


Lemma 1. Let x be the explanatory variable for an arbitrary 
observation and let X be the transformation of x defined in 
0. If the second-moment matrix E X X ^ exists and is 
positive definite, then the model 0 is identified. 


Proof: The probability of choosing an option i under the 
model 0 is a monotonic function of the objective value Qi, 
so it suffices to show that the data provides a one-to-one 
mapping between the parameter vector 9 and the objective 
values Qi,.. .,Qm- 

Let 9,9' G and dehne the vectors of objective 

function values Q = 0^X and Q' = 0'^X. Dehne 
AQ = Q — Q' = (9 — 9'V'K. G K™. The magnitude of 


0ml ~A/'(0o,J V^)i (9) 

where J = —H(0ML)/tt the empirical mean Hessian 
of the likelihood function evaluated at the estimated 
parameter value. 

Proof: See 1^ and ifTSl . ■ 

Theorem proves convergence of the parameter estimate 
9ml and provides its asymptotic distribution. This distribution 
can be used to formulate frequentist conhdence intervals for 
the parameter estimate 0ml- Furthermore, the theorem proves 
that the optimization problem 0 is convex, which allows 
us to solve it using off-the-shelf optimization algorithms. In 
the following, we use the phrase the estimator to refer to 
the procedure of using an off-the-shelf convex optimization 
algorithm to solve the maximum likelihood problem Q. We 
use the phrase the estimate to refer to the solution 0 of Q 
thus obtained. 






V. Numerical examples 

In this section we present several numerical examples to 
demonstrate the theory developed in the previous sections for 
solving the parameter estimation problem 

A. Scalar parameter 

First, we consider model (0 with m = 10 options and 
a scalar parameter 9 = 9 = 9q that we wish to estimate. 
This could correspond to a decision-maker choosing among 
ten options using a softmax model with unknown constant 
inverse temperature 0 = as in Example [T] Alternatively, 
it could correspond to a temperature varying with observation 
number k = 1,... ,n according to a known function with a 
single unknown parameter 9 — 9q, e.g., = 9/\ogk, as in 

Example In this case the log k term can be absorbed into 
the explanatory variables and we proceed as in the first case. 

Eigure shows results of applying the estimator to simu¬ 
lated data. Eor every k, when an observation was taken and a 
decision made, the model was simulated 100 times. Eor each of 
the 100 simulations, the estimator was applied to estimate the 
parameter 9 based on the first k observations. Running 100 
simulations made it possible to examine convergence of the 
estimate in distribution. Eigure illustrates how the estimates 
converge in distribution to the normal distribution (|^ as the 
number of observations n increases. Eor the simulations, the 
explanatory variables were drawn from a standard Gaussian 
distribution JV{0, 1 ) (mean zero and unit variance), and the 
response variables were drawn according to probability 
distribution 0 conditional on and 9o = 4. The estimates 
were computed by solving the optimization problem Q using 
a BEGS quasi-Newton algorithm ||2l. 111, ifTH . ESil (Matlab 
function fminunc). Theoremguarantees that the optimiza¬ 
tion problem is convex, so the algorithm will converge. 

The convergence behavior can be seen in Eigure by 
observing the mean parameter estimate 9ml as well as its 
confidence intervals. The mean parameter estimate 9ml, rep¬ 
resented by the solid black line, converges to the true param¬ 
eter value 6*0 = 4, represented by the horizontal dashed line. 
However, Theorem [T] guarantees convergence in distribution, 
which is a stronger result. To illustrate this behavior we plot 
95% confidence intervals for both the empirical distribution of 
estimates 9ml and the asymptotic distribution (|^, computed 
from the ensemble of 100 parameter estimates. Eor values 
of n greater than 100, the two intervals overlap closely, 
showing that the distribution of estimates has converged. 
Importantly, this shows that statistical hypothesis tests based 
on the asymptotic distribution (0 will be accurate. 

Eor small amounts of data, i.e., n < 50, the mean parameter 
estimate is biased above the true value. The bias is due to 
an insufficient amount of data being used in the estimation 
procedure, and the direction of the bias can be explained as 
follows. Larger values of the parameter 9 correspond to more 
deterministic choice behavior. When 9q > 0, for any given 
choice, the model is more likely to pick the option with a 
larger objective value, resulting in a parameter estimate that 
is biased upwards. This bias can be seen in Eigure as well, 
which treats a case with a vector parameter. 



Fig. 2. Scalar parameter estimation example. Illustration of the convergence 
of parameter estimates to the asymptotic normal distribution as the number 
of observations n grows. The dashed lines show the true value of the scalar 
parameter Oq = A and the accompanying 95% confidence intervals implied by 
the asymptotic normal distribution ID- For each value of n, an ensemble of 
100 parameter estimates was formed by repeatedly simulating the data y while 
holding the explanatory variables x fixed, and using the estimator to compute 
the value of the parameter. The solid black line shows the mean parameter 
estimate and the shaded region the empirical 95% confidence interval. 


B. Vector parameter 

Next, we consider the model 0 with m = 100 options 
and a vector parameter 9 = 9q with Uotj = 3 elements that 
we wish to estimate. Eigure shows results of applying the 
estimator to simulated data in this vector parameter estimation 
example. As in the scalar parameter estimation case above, 
the model was simulated 100 times for every k = 1,... ,n. 
Eigure shows how the estimate converges to the true value 
00 as the total number of observations n increases. The 
explanatory variables x*^ were drawn according to independent 
standard Gaussian distributions, and the response variables 
drawn according to the model ([^ conditional on x^ and true 
vector parameter value 9q = [1,2,3]^. The estimates were 
computed as in the scalar case. 

The convergence behavior can be seen in Eigure by 
observing the mean parameter estimate 9ml as well as 
its confidence intervals. Eor each of the three parameters 
9i, i = 1,2,3, the corresponding mean parameter estimate 
Gml ) , represented as a solid line, converges to the true 

/ i 

parameter value (0 o)a represented by a horizontal dashed line. 
The shaded regions represent the empirical 95% confidence 
interval around the corresponding mean value, computed from 
the ensemble of 100 parameter estimates. Eor clarity, we omit 
the confidence intervals implied by the asymptotic normal 
distribution (|^ from the figure, but the behavior is similar 
to that shown in Eigure 

There is an upwards bias in the parameter estimates for 
small numbers of observations n, as in Eigure The width of 
the confidence intervals for the three parameters scales roughly 
with their true value (0o)i- This behavior can be seen in the 
figures in the next section as well. 
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Fig. 3. Vector parameter estimation example. Illustration of the convergence 
of parameter estimate to the asymptotic normal distribution as the number 
of observations n grows. The dashed lines show the time value of each element 
of the vector parameter 6q = [1, 2, 3]^. For each value of n, an ensemble 
of 100 parameter estimates was formed by repeatedly simulating the data y 
while holding the explanatory variables x fixed, and using the estimator to 
compute the value of the parameter. The solid lines show the mean parameter 
estimate and the shaded regions the empirical 95% confidence interval. 


VI. Application to nonlinear objective eunctions 

USING LINEARIZATION 

The development up to this point for addressing the pa¬ 
rameter estimation problem 0 has assumed that the objective 
function takes the linear form Q. However, many relevant 
objective functions are nonlinear functions of the unknown 
parameter 9. One approach is to linearize the nonlinear ob¬ 
jective function about a nominal parameter value, and then 
apply the estimator to the linearized objective function. We 
apply this approach to the nonlinear objective function from 
the stochastic UCL algorithm ll24ll . an algorithm for human 
decision-making in multi-armed bandit tasks in a Bayesian 
setting, and show how its parameters can be estimated. 


A. The multi-armed bandit problem 

The multi-armed bandit problem, introduced by Robbins 
II 25 I is a sequential decision problem which consists of a set of 
N options (each option is also called an arm in analogy with 
the lever of a slot machine). Each option i G V}, 

has an associated probability distribution pi with mean m^, 
unknown to the agent solving the problem. At each sequential 
decision time t G {1,... ,T} the agent picks an arm if and 
receives a stochastic reward ~ Pitir) drawn from the 
probability distribution associated with that arm. This is a 


special case of the notation introduced in Section II-A with 


m = N options indexed by i and n = T decisions indexed 
by t. The agent’s objective is to maximize the expected value 
of the cumulative rewards received from the T decisions: 


max J, J = E 
{h} 


E 


n 


= > rrii 


Each choice of it is made conditional on the information 
available to the agent at time t. If the mean rewards rrii 
were known to the agent, the optimal policy would be trivial: 


pick arm it G arg maxj rrii for each t. However, since the 
mean rewards are unknown, the agent must simultaneously 
select arms where the reward value is uncertain to gain 
information about the rewards and preferentially select arms 
with high rewards to accumulate reward. The tension between 
selecting arms with uncertain (but possibly high) rewards and 
selecting arms that appear to have high rewards based on 
current information is known as the explore-exploit tradeoff. 
This tradeoff is common to a variety of problems in machine 
learning and adaptive control. 

The multi-armed bandit problem is the subject of active 
research in machine learning as well as in neuroscience. In 
El, we showed that a significant fraction of human subjects 
exhibited excellent performance in solving a multi-armed 
bandit problem, even outperforming algorithms known to have 
optimal performance in some cases. We attributed this good 
performance to the human subjects’ having good priors on the 
structure of the rewards rrii, and we designed the stochastic 
UCL algorithm as a model of human performance to capture 
this dependence on priors. Estimating the parameters of this 
model from observations of a human solving the multi-armed 
bandit task would allow a machine to learn the human’s belief 
priors. This could in turn facilitate a design of a human- 
machine system that could achieve better performance than 
either the human or the machine could on its own. 

B. The stochastic UCL algorithm 

The stochastic UCL algorithm ll24l is designed to solve 
multi-armed bandit problems with Gaussian rewards, i.e., 
where the reward distribution Pi{r) = N{mi,a‘l) is Gaussian 
with unknown mean rrii and known variance a^. The algorithm 
consists of two parts: Bayesian inference that maintains the 
agent’s belief state and a softmax decision model that uses 
an objective function Q that depends on the belief state. 
Both the inference and the decision parts introduce nonlinear 
dependencies on the parameters of the algorithm. 

As a model of human behavior, the stochastic UCL algo¬ 
rithm assumes that the agent’s prior distribution of m (i.e., 
the agent’s initial beliefs about the mean reward values m 
and their covariance) is multivariate Gaussian with mean /tq 
and covariance Eq: 


m ~ A/'(/ro,So), 

where /tq G and Eg C is a positive-definite matrix. 

In II 24 II we use a minimal set of three parameters to specify 
(/XqjEq). Eor the mean we use a uniform prior /Tq = poliv, 
where /rg G K is a single parameter that encodes the agent’s 
belief about the mean value of the rewards and is the vector 
with each element equal to 1. Eor the problems considered 
in IqM . the arms are spatially embedded with each arm at a 
different location in space (see Eigurej^in the next section). 
It is reasonable to assume that arms that are spatially close 
will have similar mean rewards. Therefore, for the covariance 
Eg we set Eg = crgE where E represents a prior that is 
exponential in distance, i.e., each element has the form 

^ij = exp(-||zi - ZjW/X), 


( 10 ) 

















where Zi is the location of arm i and A > 0 is the correlation 
length scale. The parameter cto > 0 can be interpreted as 
a confidence parameter, with (Tq = 0 representing absolute 
confidence in the beliefs about the mean /Xg, and erg = +oo 
representing complete lack of confidence. 

With this prior, the posterior distribution is also Gaussian, 
so the Bayesian optimal inference algorithm is linear and can 
be written down as follows. At each time t, the agent selects 
option it and receives a reward rt- Let r‘ be the t x 1 vector 
composed of the r*. Let n‘ be the number of times the agent 
has selected option i up to time t, let m* be the empirical 
mean reward observed for option i, and let n* and in* be the 
vectors composed of the n* and m*, respectively. For each 
time t, define the precision matrix A( = Then the belief 
state at time t is ifT^ Theorem 10.3] 




iOi 

tT ( TJ TjT 


( 11 ) 


= Mo + EgiFf (iFtSgiT,^ + \r* - iF^/Xo). (12) 


where Ht is the tx N observation matrix with Ht{t,j) = 1 if 
it = j and zero otherwise, and It is the f-dimensional identity 
matrix. 

Based on the belief state {fj,t,Ht), the stochastic UCL 
algorithm chooses arm it with probability 


Pr 


— '^\Q : 


exp{Ql/vt) 

EUeMQl/vty 


(13) 


where Q* is the heuristic function value for arm i at time t and 
Vt is the temperature corresponding to the cooling schedule 
at time t. The cooling schedule is assumed to take the form 
Vt = ly/ log t, V a constant, so the probabilities ( [T3] l become 

exp((Q* \ogt)lv) 
E^iexp((Q‘ \ogt)lv) 

The heuristic function is 


Pr 


it = i\Ct.,v 


Ql = + ^(1 - at ), ( 15 ) 

where /Xi = {f^t)i 1^ '^he posterior mean reward of arm i at 
time t and cr* = -y/its associated standard deviation. The 
quantity ) is the inverse cumulative distribution function 
of the standard normal distribution and at = lj\/2TTet is a 
decreasing function of time. 

This is a softmax decision model with unknown parameters 
{fj,o,ao,X,iy), but it is not yet in the form ([^ since the 
quantity (Q* log t) jv is a nonlinear function of the parameters. 
However, we can locally approximate (Qi log with a 
linear function by linearizing about a nominal value of the 
prior. By estimating the parameter values of the linearized 
model, we can recover the parameters of the original nonlinear 
model ([T^ near the nominal prior. 


C. Linearization 

Let (5g = be the relative precision of a reward 

measurement compared to the certainty of the prior. Fix a 


nominal prior with parameter values (/io,i5g,A) and consider 
small deviations and about /xg and Sq, respectively; 

Fo = Ao + ^0 “ ^0 A ^'5- 


In the case that the true value of A is unknown, this method is 
easily generalized to include deviations in A, but for simplicity 
of exposition we consider it fixed. Recall that the covariance 
prior is Eg = cigE, where E is defined by ( [TOl i, and its inverse 
is denoted by A = E“^. 

In terms of the nominal value Sq, ( [TT| i becomes 

At = ^ (diag(n‘) + jgA + A^A) . 

Therefore, to first order in A^, is given by 


Et = + O (Aj), (16) 

where At = ^gA + diag(n*) and i? = A = E“^. Expanding 
the square root in the following, we get 

= = (17) 


where c* is the x**^ element on the diagonal of Ct = 
and d* is the x*** element on the diagonal of Dt = 
CTg The standard deviation cr* must be nonnegative, 

which implies an upper bound on A^. Similarly, Sq must be 
nonnegative, which implies a lower bound on A^, which is 
already assumed to be small. The implied bounds on A^ are 






which, together with the requirement that A^ be small with 
respect to dg, gives a bound on the values of A^ for which 
the linearization is valid. 

Similarly, the expression 0 for Mt becomes 


Mt — L’t + FtAf^ + GtAs + O (A^) , (18) 


where A^ denotes second-order terms in the deviation vari¬ 
ables A^ and A^, and Et,Ft, and Gt are the A x 1 vectors 

Ft = Moltv H-~ Fit At ^7F(^)(m* — Htflo^N) 

(19) 

Ft = lN- - HtA-^Hy)HtlN (20) 

Gt = -At^BAt\Hyni^ - n*/xo). (21) 


Define e*,/*, and p* as the x*** components of Ft, Ft, and 
Gt, respectively. Then the linearized heuristic is 

^Ql= 0T^t ^ ^ ^ (22) 


where the parameters 6 are defined by 

a a — n — 

t/i — e /2 — -, t /3 — - 

V jy V 


(23) 


y 















and the explanatory variables x‘ are defined as 


0.6 


a;*_i = ^e‘ +^(l-at)^logf (24) 

2^12=/* log f (25) 

xU = (^gl - 

The linearized heuristic ( |2^ defines a softmax decision¬ 
making model with a linear objective function of the form 
Thus we can apply our estimation algorithm to estimate 
the parameters 0. Using ( [23] ) we can then use the estimate of 
6 to provide an estimate of the parameters (/ip, Cg, i^). 





D. Example estimations 

We tested the estimation procedure described above by 
simulating runs of the stochastic UCL algorithm for various 
parameter values. Figures]^ andshow two examples of esti¬ 
mates computed using simulated data from the stochastic UCL 
algorithm with the nonlinear objective function (Q*logf)/i^ 
and true parameters {pQ,a^,\,v) = (200,1,1,4). These 
parameters result in the algorithm achieving high performance 
(specifically, logarithmic regret, see ll24l for details). Figure 
1^ shows estimates based on linearization about the point 
{po,aQ) = (150,2). Following ( [23] l, the linearized objective 
function corresponds to parameters 61,62, and 63 having true 
values 6 »i = 2 = 0.25, 6»2 = = 12.5, and 63 = 

1.25x 10“^. These are the values to which the estimates should 
converge. Figure shows estimates based on linearization 
about the point (/Ig, CTq) = (250, 0.5). The linearized objective 
function in this case corresponds to the three parameters taking 
true values 61 = 0.25,02 = —12.5, and 63 = —2.5 x 10 “^. 

In both cases the estimator converges to the true value of 
6 within the horizon T = 100 of the decision task. Further, 
the true value of the parameter is within the 95% confidence 
interval after 30 observed choices. There are two implications 
from this result. First, the estimation procedure is at least 
somewhat robust to the choice of linearization point for this 
set of algorithm parameters. Second, the estimator is useful 
for realistic empirical data sets, such as those reported in ll24ll 
and studied in the following section. For these data sets the 
horizon is T = 90 choices. For this amount of data, the 
simulations show that the estimation procedure can identify 
the true value of the parameter in a statistically signihcant 
way. This result is valuable because the rigorous convergence 
result from Theorem[T]does not directly guarantee convergence 
in the more general case of nonlinear objective functions. 

The amount of data required to get a reliable estimate 
can depend on the true value of the algorithm parameters, 
as shown in Figure In this case, the true value of the 
algorithm parameters are (pg, dg. A, v) = (30,10^, 0, 0.5) and 
the linearization is made about the point (/Ig, CTq) = (40, 950). 
The linearized objective function corresponds to the three 
parameters taking true values 61 = 2 ,02 = —20, and 
63 = —1.05 X 10“®. With the true values of the prior in 
the algorithm, the agent is sufficiently uncertain about the 
rewards and makes most of its initial 100 choices at random 


Fig. 4. Estimate of the vector of parameters 0 based on simulated data 
from the stochastic UCL algorithm. The linearization point was taken to be 
po = 150, CTq = 2. The true algorithm parameters were /ig = 200, (Tg = 
1, A = 1, and = 4. The estimate converges as the number of observations 
t grows. The dashed lines show the true value of each parameter 6i. For each 
value of t, an ensemble of 100 parameter estimates was formed by repeatedly 
simulating the data {(xt,yt)}j_j while holding the parameters 0 fixed, and 
using the estimator to compute the value of the parameters. The solid lines 
show the mean parameter estimate and the 95% confidence interval implied 
by the asymptotic normal distribution j^. 

in order to gain information about the rewards. This choice 
behavior results in low performance (specihcally, linear regret, 
see ll^ for details). Since the initial choices are effectively 
made at random, they do not provide useful information 
about the parameter values (except that they represent some 
combination of an uncertain prior and high decision noise). 
The uncertainty in the parameter values can be seen from the 
width of the conhdence interval around the mean parameter 
estimates shown in Figure For 0i and 02 their width is 
many orders of magnitude larger than the magnitude of the 
parameter and they are not displayed. For 63 , the estimate 
exhibits persistent bias away from the true value, but the width 
of the associated confidence interval is significantly larger 
than the bias. Therefore, for such parameter values, one must 
observe more data to be able to shrink the confidence intervals 
and provide precise estimates of the parameter values. 

E. Discussion 

The linearization procedure described above yields a local 
linear approximation to the likelihood maximization problem 
0 , and Theorem [T] provides conditions under which the local 
approximation results in an identihed model with a convex 
optimization problem. However, the effectiveness of the pro¬ 
cedure is sensitive to the choice of nominal prior (/ig, Sq) about 
which to linearize. The linearization point should be chosen 
such that the linear approximation is valid at the (unknown) 
true value of the parameters. In the worst case, there might not 
be any intuition for choosing the linearization point, making 
the above procedure no better than any other local optimization 
technique for which a starting point must be chosen. 

Fortunately, there are several advantageous aspects of the 
problem. The first is generic to any heuristic function, which 
is the fact that the likelihood function forms a unique objec¬ 
tive for judging the “goodness” of the estimated parameter. 































within the region of parameter space associated with a given 
behavioral class. In the following section we exploit this 
intuition to estimate the parameters of the stochastic UCL 
algorithm based on data from a human subject experiment. 

VII. Application to experimental data 

In this section we apply the estimator to fit the stochastic 
UCL model ( [T4l l to experimental data studied in ll24l . By 
fit, we refer to the process of selecting a nominal parameter 
for linearization and applying the estimator to the linearized 
model. The parameter estimates produced by the htting proce¬ 
dure show that individuals with high performance match their 
behavior to the task in a statistically-significant way. 


Fig. 5. Estimate of the vector of parameters 6 based on simulated data from 
the stochastic UCL algorithm. Everything is the same as in Figure except 
that the linearization point was taken to be /2o = 250, o-q = 0.5. 



Fig. 6. Estimate of the vector of parameters 6 based on simulated data 
from the UCL algorithm with a weakly-informative prior. This prior makes 
the algorithm’s choice behavior more random, which makes the estimation 
problem more difficult. Everything is the same as in Figure except that 
the linearization point was taken to be /2o = 40, o-q = 950 and the true 
algorithm parameters were fiQ = 30,f7Q = 10^, A = 0, and i/ = 0.5. The 
95% confidence interval implied by the asymptotic normal distribution 0 is 
shown only in the plot of 63 . For pai'ameters 9i and 62 , the width of the 
confidence intervals are much greater than the magnitudes of the parameter 
estimates and ai‘e omitted for legibility. 


Without knowing in advance a good choice of linearization 
point, one approach is to perform the estimation assuming 
two different choices of linearization points and to compare 
the resulting estimates 6 . If the two linearization points result 
in identical estimates there is no conflict, while if the estimates 
differ, the one with the higher likelihood value is better. 

Second, there may be intuition about a appropriate choice 
of linearization point due to the structure of the model. In 
ll24l . we showed that behavior of the stochastic UCL model 
falls broadly into three classes as a function of the parameters 
(T§, A, v). Thus, by categorizing a given data set into one 
of the three classes, we narrow the search for a linearization 
point to the associated regions of parameter space. And, as 
we saw in Figures]^ andthe stochastic UCL model appears 
to be relatively insensitive to the choice of linearization point 


A. Experimental setup 

This section reviews the experimental setup as presented 
in Reverdy et al. Cl. As described in Cl. we collected 
data from a human subject experiment where we ran multi¬ 
armed bandit tasks through web servers at Princeton Univer¬ 
sity (Princeton, NJ, USA) following protocols approved by 
the Princeton University Institutional Review Board. Human 
participants were recruited using Amazon’s Mechanical Turk 
(AMT) web-based task platform 0. Participants were shown 
instructions that told them they would be playing a simple 
game during which they could obtain points, and that their 
goal was to obtain the maximum number of total points in 
each part of the game. 

Each participant was presented with a set of N = 100 
options, presented as squares arranged in a 10 x 10 grid. See 
Figure for a visualization of the experimental interface. At 
each decision time t G {1,...,T}, the participant made a 
choice by moving the cursor to one square in the grid and 
clicking. After each choice was made, a numerical reward 
associated with that choice was reported on the screen. A 
variety of aspects of the game, including timing, game dy¬ 
namics, and reward structures, were manipulated as part of 
the experimental design. As a result of these manipulations, 
only 326 of the 417 participants were assigned to a standard 
multi-armed bandit task for which the stochastic UCL model 
is appropriate. In the remainder of the section, we focus 
exclusively on data from these 326 participants. 

The mean value of the reward associated with choosing a 
particular option i was rrii. Since the options were arranged 
in a 10 X 10 grid, the set of mean values can be thought of as 
a real-valued function on the discrete two-dimensional grid. 
We refer to this function as the reward landscape, and prior 
knowledge about the rewards in a given task corresponds to 
prior knowledge about the landscape. Mean rewards in each 
task corresponded to one of two landscapes; Landscape A and 
Landscape B, shown in Figure Each landscape was flat 
along one dimension and followed a profile along the other 
dimension. The profile of Landscape A was such that a simple 
gradient-climbing strategy was likely to prove effective, while 
Landscape B was constructed to require a more sophisticated 
strategy. Each participant played the game with each landscape 
once, presented in random order. Due to the structure of the 


























You just earned so points 



Fig. 7. The experimental interface used in the human subject experiment. 
Upon clicking on one of the 100 squares arranged in a 10 x 10 grid, the red 
dot would move to the center of the square. The subject was free to select 
a new square without penalty until the time allotted (1.5 or 6 seconds per 
choice) had elapsed, at which time the blue dot would move to the center of 
the selected square and the subject would receive a reward reported in the 
text box at the top of the screen. Originally appeared as Figure 5 of ED ; 
reproduced with permission. 

experimental design, only one of the two landscapes was 
associated with a standard multi-armed bandit task. 

The participants’ performance in a given task can be 
classified in terms of the growth rate of their cumulative 
regret, which is a measure of cumulative loss relative to the 
(unknown to the subject) optimal decision. As reported in ll24l . 
70 of the 326 participants, or approximately 21%, achieved 
high performance while the remainder, approximately 79%, 
achieved low performance. Of the 206 subjects assigned 
to Landscape A, 53 achieved high performance. Likewise, 
of the 120 subjects assigned to Landscape B, 17 achieved 
high performance. The high-performing subjects outperformed 
standard frequentist algorithms on the task, which we attribute 
to the subjects’ having good priors about the task. Since we 
did not explicitly convey prior knowledge about the reward 
landscapes to the subjects, we postulate that they used priors 
developed in the course of other spatial tasks encountered 
in daily life. Considering the stochastic UCL algorithm as a 
model of the subjects’ behavior, good priors correspond to 
good values for the parameters /i.g and Eg, which quantify 
the subjects’ intuition about the task. To learn the priors we 
propose estimating them from the data. The estimated priors 
could then be used, e.g., to improve the performance of an 
automated system. 

B. Fitting 

In fitting the stochastic UCL model to human subject data, 
we seek to answer two questions. First, what distinguishes the 
decision-making of the subjects with high performance from 
those with low performance? And second, do subjects adapt 
their decision-making strategies to the task, i.e., the reward 
landscape? Our experimental design provides data from only 



Fig. 8. The two task reward landscapes: (a) Landscape A, (b) Landscape B. 
The two-dimensional reward surfaces for the 10x10 set of options followed 
the profile along one dimension (here the x direction) and were flat along 
the other (here the y direction). The Landscape A profile is designed to be 
simple in the sense that the surface is concave and there is only one global 
maximum {x = 6), while the Landscape B profile is more complicated since 
it features two local maxima (x = 1 and 10), only one of which {x = 10) 
is the global maximum. Originally appeared as Figure 6 of (21; reproduced 
with permission. 

one task per subject, so we cannot, for example, compare a 
single subject’s performance on the different landscapes. Thus, 
we analyze at the population-level to answer the two questions. 

Each subject is classified as having high or low performance 
as described above. On the basis of this classification and the 
reward landscape, the subject is assigned to one of the four 
performance-landscape combined categories. We assume each 
subject represents an independent and identically distributed 
(iid) sample from the true parameter Oq associated with its 
category. We applied the estimator to data from each subject 
using nominal parameters (/2 g,CTQ,A) = (30,10,0.1) for 
subjects with low performance and (/ig, (Jg, A) = (200,10®, 4) 
for subjects with high performance. We validated the choice of 
A by performing estimation on the data from several subjects 
using a variety of values of A. The optimal value of A clearly 
differed between the two categories of performance but the 
estimates for each given performance category were fairly 
robust to changes in the value of A. The fitting procedure 
produces a maximum likelihood estimate and associated co- 











































variance matrix for each subject. By the iid assumption, it 
is tenable to construct a population-level parameter estimate 
for each of the four categories by appropriately averaging the 
individual subjects’ estimates and covariances. 

Table presents the population-level parameter estimates, 
along with the mean log likelihood values, for the four cate¬ 
gories. The columns labeled d report the maximum likelihood 
parameter estimates and those labeled a their asymptotic stan¬ 
dard deviations implied by (|^. Recall that these parameters 
represent deviations from the nominal parameter values and 
therefore are not directly comparable between performance 
categories. However, comparing the magnitude of the standard 
deviations shows that the parameter estimates are much more 
precise for those categories associated with high performance. 
This is consistent with our findings in Section VI-D Table |I^ 
presents the maximum likelihood parameter estimates trans¬ 
formed back into the original variables and dg; these 

are directly comparable. 

Table |I^ allows us to answer our first question about the 
differences between subjects with different levels of perfor¬ 
mance. The parameter values clearly differ more between 
levels of performance rather than between landscapes. Be¬ 
tween levels of performance the parameters that differ the most 
are the decision noise parameter v and the prior uncertainty 
(Tq. Larger values of v are associated with more random 
decision-making, while larger values of dg represent greater 
uncertainty about the rewards which is associated with placing 
a higher value on information. Both of these factors tend to 
encourage exploration, and the values of both v and dg are 
much greater for subjects with high performance than those 
with low performance. Thus, for both landscapes, the high- 
performing subjects explore more than the low-performing 
ones, which presumably helps them discover the regions of 
high rewards. Furthermore, the subjects with high performance 
use correlated priors which allow them to quickly explore large 
regions of the reward surface. 

We can compare the quality of the model fits by comparing 
the mean log likelihood values across categories provided on 
Table |I] Again, we see starker differences between levels of 
performance than between landscapes. Between landscapes, 
the fits are approximately equal in quality, while between 
performance levels there is substantial difference, equal to an 
approximate doubling of the fitted model’s predictive power. 

Table |I] allows us to answer our second question about 
the degree to which subjects match their strategies to the 
task. We focus on comparing the parameters across landscape 
conditions for each of the performance categories separately. 
For low-performing subjects, comparing the relative magni¬ 
tudes of the parameter estimates and their standard deviations 
suggests that there is no significant difference between the 
two landscape conditions. The two-sided Welch’s f-test ||3^ 
confirms that the difference in the parameter estimates is 
statistically insignificant. For high-performing subjects, the 
parameter estimates are much more precise, and the two-sided 
Welch’s f-test confirms that the difference in the parameter 
estimates is statistically significant at the 95% confidence 
level. In other words, the fitting procedure is able to distinguish 
that the high-performing subjects have strategies that are 


Low (linear, power-law) performance 
Landscape A, 153 subj. Landscape B, 103 subj. 


Mean log likelihood: -338 

Mean log likelihood: -331 

9 

a 

9 

a 

0.360 

90.4 

0.252 

1.32 

-5.22 

1.27e3 

-2.12 

51.8 

0.433 

1.02e2 

0.213 

8.61 


High (log-law) performance 


Landscape A, 53 subj. 

Landscape B, 17 subj. 

Mean log likelihood: -273 

Mean log likelihood: -271 

9 

a 

9 

a 

3.93e-2 

1.18e-3 

3.39e-2 

1.04e-3 

-6.86 

0.226 

-6.57 

0.268 

7.88e-7 

2.34e-8 

6.80e-7 

2.06e-8 


TABLE I 

Parameter estimates 9 and associated standard deviations a 

CONDITIONAL ON REGRET GROWTH ORDER AND REWARD LANDSCAPE. 

The values for high performance are significantly different 

BETWEEN SURFACES AT THE 95% CONFIDENCE LEVEL (TWO-SIDED 

Welch’s I-test l3^ L other comparisons show that the 

PARAMETER VALUES DO NOT SIGNIFICANTLY DIFFER BETWEEN CLASSES. 


matched to the landscape. 

C. Implications for human-centered automation 

The results of the fitting exercise have several implications 
for human-centered automation. First, they demonstrate an 
estimator for a model of human decision-making behavior. The 
estimator allows one to quantify a human subject’s intuition 
in a statistically powerful way. Second, the model fits are 
of higher quality for subjects with high performance. This 
suggests that the stochastic UCL model is better suited to the 
decision-making behavior of subjects who are experts at the 
task; a different model may be more appropriate for lower- 
performing subjects. Third, subjects with high performance 
seem to have effective priors: these priors have low certainty 
(large values of Ug), but exploit correlation in the rewards 
due to the smoothness of the reward landscapes by using 
positive values of the length scale parameter A. When such 
correlation structures exist, they can be exploited to greatly 
improve performance 1291 . as our human subjects appear to 
have done. The estimator provides a way to learn effective 
priors from a human operator. In the absence of a correlation 
structure, the above fitting process can still be applied by 
setting A = 0, although convergence of the estimator will 
be slower, requiring longer series of choice data than those 
studied here. 

By analyzing data from a human subject experiment, we 
have shown the effectiveness of the linearization procedure for 
extending the estimator to a model with a nonlinear objective 
function. The known asymptotic properties of the estimator 
allowed us to perform tests for statistical significance and find 
differences in behavior. 

VIII. Conclusion 

Motivated by the parameter estimation problem for 
decision-making models, we studied the maximum likelihood 
parameter estimation problem for softmax decision-making 
models with linear objective functions. Such models occur fre¬ 
quently in the neuroscience and machine learning literatures. 





Low (linear, power-law) performance 


Landscape A, 153 subj. 

Landscape B, 103 subj 

Parameter 

Value 

Pai'ameter 

Value 

V 

2.78 

V 

3.97 

/^O 

15.5 

MO 

21.6 


4.54 


5.42 


High (log-law) performance 


Landcape A, 53 subj. 

Landscape B, 17 subj. 

Parameter 

Value 

Parameter 

Value 

V 

25.5 

V 

29.5 

Mo 

25.3 

Mo 

6.08 


3.32e5 


3.35e5 


TABLE II 

Parameter estimates v, ho, tq and associated standard 

DEVIATIONS a CONDITIONAL ON REGRET GROWTH ORDER AND REWARD 
LANDSCAPE. 

We derived conditions under which the maximum likelihood 
estimator converges on the correct parameter values, character¬ 
ized the estimator’s asymptotic distribution, and showed how 
to use this distribution to formulate confidence intervals for 
the parameter estimates. 

We then showed that the stochastic UCL algorithm could 
be transformed into a softmax decision-making model with a 
linear objective function by linearizing the objective function 
about a nominal point in parameter space. By performing 
parameter estimation on the linearized model using simulated 
data, we showed that we could estimate the true value of 
the stochastic UCL algorithm parameters. The amount of 
data required to perform useful estimation depends on the 
region of parameter space, with parameters representing priors 
that strongly influence behavior (e.g., small variances ctq, 
representing strong beliefs, or large correlation length scales A, 
representing highly structured beliefs) being easier to estimate. 
This is intuitive, as observed behavior will be more sensitive 
to such influential beliefs. 

The estimator convergence results we state in Theorem [T] 
are specific to the case where the objective function is a linear 
function of the unknown parameters. However, we showed 
how the estimation procedure can be extended to nonlinear 
objective functions with linearization. Using the linearization 
technique with the estimator, we fit the stochastic UCL model 
developed in ll24l to data from a human subject experiment. 
The estimates show a statistically significant difference in 
behavior between subjects who exhibit good performance in 
similar but different tasks. Quantifying these differences are 
of interest both for the science of decision-making but also 
for the development of automation technology. In conjunction 
with the stochastic UCL model, the estimator developed in 
this paper provides the tools for quantifying human decision¬ 
making behavior in multi-armed bandit problems. These tools 
will facilitate the principled development of human-machine 
decision-making teams. 
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